Policy Gradient Methods for Reinforcement Learning with Function Approximation

نویسندگان

  • Richard S. Sutton
  • David A. McAllester
  • Satinder P. Singh
  • Yishay Mansour
چکیده

Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams’s REINFORCE method and actor–critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

University of Alberta Gradient Temporal - Difference Learning Algorithms

We present a new family of gradient temporal-difference (TD) learning methods with function approximation whose complexity, both in terms of memory and per-time-step computation, scales linearly with the number of learning parameters. TD methods are powerful prediction techniques, and with function approximation form a core part of modern reinforcement learning (RL). However, the most popular T...

متن کامل

Natural-Gradient Actor-Critic Algorithms

We prove the convergence of four new reinforcement learning algorithms based on the actorcritic architecture, on function approximation, and on natural gradients. Reinforcement learning is a class of methods for solving Markov decision processes from sample trajectories under lack of model information. Actor-critic reinforcement learning methods are online approximations to policy iteration in ...

متن کامل

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradien...

متن کامل

Natural actor-critic algorithms

We present four new reinforcement learning algorithms based on actor–critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor–critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochasti...

متن کامل

Reinforcement Learning for Humanoid Robotics

Reinforcement learning offers one of the most general framework to take traditional robotics towards true autonomy and versatility. However, applying reinforcement learning to high dimensional movement systems like humanoid robots remains an unsolved problem. In this paper, we discuss different approaches of reinforcement learning in terms of their applicability in humanoid robotics. Methods ca...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999